Simplify checkpointer and make it work for large models #628

ali-ramadhan · 2020-02-16T05:47:29Z

This PR finally upgrades the checkpointer so it can restore large models that take up more than 50% of system memory. It used to create a model then restore the fields which allocates twice as much memory as needed. Now the data needed to restore the fields is passed to the model constructor so no double allocation. Some refactoring had to happen to make this possible.

This PR is also part 2/3 of making boundary conditions a field property.

Should help a lot with #602 and #603.

Resolves #416
Resolves #417

Note: This PR branches off #627.

glwagner · 2020-02-17T14:05:13Z

@ali-ramadhan I will wait until part 1/3 is merged to review this.

ali-ramadhan · 2020-02-19T19:13:12Z

@glwagner Just merged part 1 (#626) so this PR is now a lot smaller and should be easier to review.

codecov · 2020-02-19T21:09:37Z

Codecov Report

Merging #628 into master will decrease coverage by 4.67%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #628      +/-   ##
==========================================
- Coverage   75.13%   70.45%   -4.68%     
==========================================
  Files         118      118              
  Lines        2280     2288       +8     
==========================================
- Hits         1713     1612     -101     
- Misses        567      676     +109

Impacted Files	Coverage Δ
src/Solvers/solver_utils.jl	`0.00% <0.00%> (-100.00%)`	⬇️
src/Solvers/channel_pressure_solver.jl	`30.13% <0.00%> (-68.50%)`	⬇️
...c/Solvers/horizontally_periodic_pressure_solver.jl	`41.50% <0.00%> (-56.61%)`	⬇️
src/Solvers/plan_transforms.jl	`66.66% <0.00%> (-33.34%)`	⬇️
src/Fields/set!.jl	`36.53% <0.00%> (-25.23%)`	⬇️
src/Architectures.jl	`66.66% <0.00%> (-22.23%)`	⬇️
src/Fields/field.jl	`69.69% <0.00%> (-9.10%)`	⬇️
...nditions/solution_and_model_boundary_conditions.jl	`89.18% <0.00%> (-5.41%)`	⬇️
src/Fields/field_tuples.jl	`78.94% <0.00%> (-5.27%)`	⬇️
src/Buoyancy/Buoyancy.jl	`57.89% <0.00%> (-5.27%)`	⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4ef4d11...48e5424. Read the comment docs.

glwagner

A couple questions / comments:

This PR finally upgrades the checkpointer so it can restore large models that take up more than 50% of system memory. It used to create a model then restore the fields which allocates twice as much memory as needed.

This is only relevant for CPU models --- correct? In other words, when running on the GPU we still need to load data onto the CPU and then transfer to the GPU. Previously, the option was available to create a model on the GPU, load data on to the CPU, and then copy that data to the GPU --- right? Or am I missing something?

edit: for example:

julia> using CuArrays

julia> a = CuArray{Float64}(undef, 10, 10); b = rand(10, 10);

julia> copyto!(a, b)

Does this PR impact the user API for checkpointing at all, or does it just change restore_from_checkpoint, which does everything behind the scenes?

We may want to move the section on checkpointing that is currently in the documentation at Model setup > Output writers to its own section within the documentation.

src/OutputWriters/checkpointer.jl

ali-ramadhan · 2020-02-20T12:12:24Z

This is only relevant for CPU models --- correct? In other words, when running on the GPU we still need to load data onto the CPU and then transfer to the GPU.

That is technically true although in general you tend to have a lot more CPU memory than GPU memory so I suspect it won't be an issue there. You could reduce memory allocation even further by loading all fields from disk into a temporary array, then copying to a CuArray one field at a time. But I don't think that'll work with JLD2 as the entire file is loaded into memory at once. You could do it with chunked NetCDF files, for example, where you read specific chunks into memory at a time.

Previously, the option was available to create a model on the GPU, load data on to the CPU, and then copy that data to the GPU --- right? Or am I missing something?

You could do that but no user would have gone through the trouble.

Does this PR impact the user API for checkpointing at all, or does it just change restore_from_checkpoint, which does everything behind the scenes?

Just changes what happens behind the scenes. The API is the same but checkpoint files now have file names like convection_iteration12500.jld2 instead of just convection_12500.jld2 although maybe we should use the word checkpoint in the file names.

We may want to move the section on checkpointing that is currently in the documentation at Model setup > Output writers to its own section within the documentation.

Sounds like a good idea, will do before merging.

glwagner · 2020-02-20T15:55:12Z

You could do that but no user would have gone through the trouble.

Well, fair, I was more commenting that "we" could have done that behind the scenes for the GPU case.

That is technically true although in general you tend to have a lot more CPU memory than GPU memory so I suspect it won't be an issue there.

I'm still confused about how this changes what was previously done. Did we previously create arrays on the GPU, copy checkpoint from the CPU to "temporary" GPU arrays, and then copy from temporary GPU arrays to previously-instantiated model data? Or am I missing something (is it possible to load data from disk directly to the GPU?)

ali-ramadhan · 2020-02-20T16:13:03Z

I'm still confused about how this changes what was previously done. Did we previously create arrays on the GPU, copy checkpoint from the CPU to "temporary" GPU arrays, and then copy from temporary GPU arrays to previously-instantiated model data?

Yeah this is what we were previously doing in restore_fields!:
https://github.com/climate-machine/Oceananigans.jl/blob/8352e56f5839b23d3441f6f8bd0f297f3e0b508f/src/OutputWriters/checkpointer.jl#L102

which is kinda stupid stupid because you create a temporary CuArray then copy elements over (we already allocated memory for the fields in the model). We could have done it with a copyto!(::CuArray, ::Array) which would avoid unnecessary allocations except for one temporary array.

Now we construct the fields with the restored data and pass it to the model constructor so there are no temporary arrays and zero unnecessary allocations.
https://github.com/climate-machine/Oceananigans.jl/blob/4ed366019c3f6a9b3ba9cf19691fef721204ea3c/src/OutputWriters/checkpointer.jl#L110-L114

is it possible to load data from disk directly to the GPU?

Hmmm, not sure but doesn't seem impossible. At the lowest level it'll have to do some host to device copies though I think. @leios or @vchuravy would know.

ali-ramadhan added 9 commits February 15, 2020 22:16

Use a list of required properties to be checkpointed

5024b6c

Move tracer field construction to model constructor

d8cabdb

Make the time stepper and time stepper method model constructor kwargs

c858ea2

Use TendencyFields instead of Tendencies for consistency

4d49032

Need an AbstractTimeStepper for dispatch

dd2ebfa

Fix saveproperty! for CoordinateBoundaryConditions

45cbeac

Include time steppers module before diagnostics and output writers

9135f9c

Simplify checkpointer and make it work for large models

3ceb021

Gotta reconstruct the offset arrays

756c050

ali-ramadhan added performance 🏍️ So we can get the wrong answer even faster cleanup 🧹 Paying off technical debt GPU 👾 Where Oceananigans gets its powers from labels Feb 16, 2020

ali-ramadhan requested a review from glwagner February 16, 2020 05:47

ali-ramadhan added 3 commits February 16, 2020 00:49

Good place to bump v0.22.0

5ffae9c

Fix typo in restoring time stepper from checkpoint

ead0439

Use tracernames in the right places

2bd2d34

This was referenced Feb 17, 2020

Upgrade fields to store boundary conditions #631

Merged

Ability to construct field tuples with non-zero data #627

Merged

Merge branch 'master' into ar/checkpointer-upgrade

18ee5f1

glwagner approved these changes Feb 19, 2020

View reviewed changes

src/OutputWriters/checkpointer.jl Outdated Show resolved Hide resolved

Rename _arr to convert_to_arch

7a4fe30

ali-ramadhan added 3 commits February 20, 2020 07:18

Forgot to add Simulation docstrings to docs

80da306

Give checkpointing its own section in model setup docs

537c626

Update diagnostic and output writer model setup docs to use Simulations

48e5424

ali-ramadhan merged commit 4ed3660 into master Feb 20, 2020

ali-ramadhan deleted the ar/checkpointer-upgrade branch February 20, 2020 14:07

suyashbire1 mentioned this pull request Feb 27, 2020

[WIP] Adds a restarter keyword to Model constructor #603

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify checkpointer and make it work for large models #628

Simplify checkpointer and make it work for large models #628

ali-ramadhan commented Feb 16, 2020 •

edited

Loading

glwagner commented Feb 17, 2020

ali-ramadhan commented Feb 19, 2020

codecov bot commented Feb 19, 2020 •

edited

Loading

glwagner left a comment •

edited

Loading

ali-ramadhan commented Feb 20, 2020

glwagner commented Feb 20, 2020

ali-ramadhan commented Feb 20, 2020

Simplify checkpointer and make it work for large models #628

Simplify checkpointer and make it work for large models #628

Conversation

ali-ramadhan commented Feb 16, 2020 • edited Loading

glwagner commented Feb 17, 2020

ali-ramadhan commented Feb 19, 2020

codecov bot commented Feb 19, 2020 • edited Loading

Codecov Report

glwagner left a comment • edited Loading

Choose a reason for hiding this comment

ali-ramadhan commented Feb 20, 2020

glwagner commented Feb 20, 2020

ali-ramadhan commented Feb 20, 2020

ali-ramadhan commented Feb 16, 2020 •

edited

Loading

codecov bot commented Feb 19, 2020 •

edited

Loading

glwagner left a comment •

edited

Loading